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Abstract 

This article describes research undertaken in order to design a methodology for the reticular 
representation of knowledge of a specific discourse community. To achieve this goal, a 
representative corpus of the scientific production of the members of this discourse 
community (Universidad Politecnica de Valencia, UPV) was created. The article presents 
the practical analysis (frequency, keyword, collocation and cluster analysis) that was carried 
out in the initial phases of the study aimed at establishing the theoretical and practical 
background and framework for our matrix and network analysis of the scientific discourse of 
the UPV. In the methodology section, the processes that have allowed us to extract from the 
corpus the linguistic elements needed to develop co-occurrence matrices, as well as the 
computer tools used in the research, are described. From these co-occurrence matrices, 
semantic networks of subject and discipline knowledge were generated. Finally, based on 
the results obtained, we suggest that it may be viable to extract and to represent the 
intellectual capital of an academic institution using corpus linguistics methods in 
combination with the formulations of network theory. 

Keywords: corpus linguistics, co-occurrence matrices, semantic networks, knowledge 
discovery. 

Resumen 

En este articulo describimos la investigacion que se ha desarrollado en el diseho de una 
metodologia para la representacion reticular del conocimiento que se genera en el seno de 
una institucion a partir de un corpus representativo de la produccion cientifica de los 
integrantes de dicha comunidad discursiva, la Universidad Politecnica de Valencia.. Para 
ello, presentamos las acciones que se realizaron en las fases iniciales del estudio 
encaminadas a establecer el marco teorico y practico en el que se inscribe nuestro analisis. 
En la seccion de metodologia se describen las herramientas informaticas utilizadas, asi como 
los procesos que nos permitieron disponer de aquellos elementos presentes en el corpus, que 
nos llevarian al desarrollo de matrices de co-ocurrencias con las que se generaron redes 
semanticas del conocimiento disciplinar. Finalmente, a partir de los resultados obtenidos, 
constatamos la viabilidad de extraer y representar el capital intelectual basandonos en los 
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principios de la lingulstica de corpus en combinacion con las formulaciones de la teorla de 
redes. 

Palabras clave: linguistica de corpus, articulos academicos, matrices de co-ocurrencias, 
redes semdnticas, descubrimiento del conocimiento . 

I. INTRODUCTION 

This article proposes a model for the application of network analysis to the field of corpus 
linguistics as a method for the representation of the knowledge that is generated in our 
academic discourse community. The initial idea is a simple one: the words that conform a 
corpus are the nodes of an interrelated linguistic network. The article analyzes the discourse 
of science and technology by means of the study of keywords and their co-selection in 
research articles belonging to a corpus of 1,376 articles (a total of 6.104.323 words). All of 
the articles have been taken from specialist journals and have been written by our academic 
staff and represent the work of a unique discourse community. These articles have been 
published in journals that are indexed in the Science Ctation Index {SCI®). 

The hypothesis which we started from in our investigation is that language, and in this 
case written text, is the vehicle of exchange and transmission of knowledge between the 
members of a discourse community. What we are dealing with here is an attempt to extract 
the knowledge that has been shaped in scientific articles, to analyze it and to organize it so 
as to be able to represent it. To achieve this, we made use of our selected corpus of journal 
articles and their analysis, the microscopic and macrocospic study of certain lexico- 
grammatical characteristics which realize networks of meaning, the knowledge that is 
generated in a university context. In this academic scenario, terminology extraction and 
analysis becomes a central issue. 

According to the Firthian tradition, collocations manifest certain lexical and semantic 
affinities that go beyond grammatical restrictions. Sinclair (1991: 170) refers to collocation 
as “the occurrence of two or more words within a short space of each other in a text”; 
logically, this definition could be making reference to the co-selection between lexical 
and/or grammatical items. From the point of view of network theory, we can explain the 
concept in the following way: if two units a, b are related in terms of collocational statistics 
(or are simply frequent bigrams) as are units b, c, then there is an implicit and indirect 
relation between a and c, even though there has been no direct confirmation of an existing 
collocational relationship between a and c. We have been cautious in our assumptions in this 
study, using only the collocations/bigrams of those words that had been obtained as 
keywords and may be considered to be cohesive nodes, because they are related at least 
three times with other keywords (Hoey, 1991). 

The article explains how we generated matrices of co-occurrences of keywords and 
how we visualized the co-appearance of these keywords in 23 different areas of specialized 
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knowledge and for the eorpus in its totality. For this task, we had to use various eomputer 
programs. Wordsmith was used to extraet keywords from the eorpus. An initial listing of 
keywords was obtained by eomparing our eorpus {UPV Corpus) of English researeh artieles 
with a eorpus of general English {British National Corpus). At the same time, listings of 
keywords of eaeh one of the 23 speeialist areas were obtained by eomparing the initial 
listing of keywords extraeted from the UPV eorpus with eaeh of the speeialist areas {key-key 
words). The matriees of keywords were made by means of a program we developed using 
Perl and dumped onto spreadsheets. At this point, eaeh one of the matriees was transferred 
to the Ueinet program and, finally, the networks were visualized with the Netdraw utility. 

A high-priority objeetive of the artiele is to show how these intratextual and intertextual 
networks generated from the keywords offer granular fragments of knowledge that are 
dispersed within, throughout and aeross texts, and eontain a high semantie load. Advanees in 
network theory not only provide a suitable framework of integration, but they may open new 
perspeetives in the study of language and the organization of knowledge. Corpus linguisties 
in eombination with network analysis may beeome a teehnique applieable to the diseovery 
of knowledge and, in our partieular ease, diseiplinary and subjeet knowledge. 

II. METHOD 

In the study, we have been able to diseover how words used in seientifie terminology 
dependent on a speeialized field of knowledge, generally display low frequeney statisties in 
the normal diseourse of general English. These speeialized terms help to define the 
eommunities that use them in the same way as these eommunities define their terms. The 
information eompiled in the different stages of the researeh has made use of the notions of 
word frequeney, keywords and lexieo-grammatieal relations, that is to say, the lexieo- 
grammatieal phenomena of eolloeation, semantie prosody and eolligation. Similarly, basing 
ourselves on statistieal relevanee, we have evaluated the degree of interaetion, the 
assoeiations that take plaee between eertain lexieo-grammatieal items in our researeh. 

Besides the intratextual study realized, eertain intertextual aspeets have been eonsidered 
that have allowed us to deteet variations whieh are produeed within the same genre. For this 
purpose, we have worked in the development of eomputer applieations designed to suit our 
needs. We have been able to eompare our tools with other existing eommereial tools on the 
market that have similar aims, sueh as for example Wordsmith Tools. Both the advantages 
and the weaknesses of these tools as well as the results obtained after their use have been 
eompared. These questions have been addressed by analyzing our UPV eorpus in a general 
and global manner, as well as for eaeh one of the speeialized knowledge areas within the 
eorpus. 

Onee eoneluded the intratextual analysis, in the following stage, an analysis was earned 
out that allowed us to quantify and to represent eonerete aspeets about variation and 
reeursivity at the intertextual level. We started from the premise that, over and above 
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individual texts, there exist textual maerostmetures that various texts share or is generie to 
them and that it is possible to aeeess these maerostmetures by means of eorpus linguistie 
methods. 

Authors sueh as Kristeva (1966), Barthes (1970) or Bakhtin (1986) understand 
intertextuality in the sense that a text is always tied to other texts or previous experienees 
and show prospeetion to future texts or wordings and statements. The intertextual aets of 
retrospeetion and prospeetion means that the interaetive foree of a text extends baek to 
previous texts and forward to future texts. De Beaugrande and Dressier (1981) affirm that 
any text must fulfill the requirement of intertextuality so that it ean be eonsidered itself to be 
a text and that, in addition, intertextuality determines the way that the use of a eertain text 
depends on the knowledge of other texts. For these authors, the term intertextuality refers to 
the dependeney relation that is established between the proeesses of produetion and 
reeeption of a eertain text and the knowledge that the partieipants in the eommunieative 
interaetion already have of other previous texts related to the text in question. 

Along the same lines, Fairelough (2002) defends an intertextual perspeetive for the 
analysis, for example, of pre-eonstmeted phrases and fixed eolloeations. 

Onee delimited the framework for this phase of the study, we defined as speeifie 
objeetives: 

• To represent the frequeney of eaeh keyword in eaeh of the different doeuments that 
make up the areas of knowledge within the UPV Corpus 

• To represent the distribution of eaeh of these keywords in the different seetions that 
traditionally form part of the researeh artiele {IMRD) 

• To relate and to represent the interaetions between terms aeeording to their frequeney 
rate 

• To eompare and to represent the degree of reeursivity that is produeed with regards to 
identieal language patterns of different length (elusters) in eaeh one of the analyzed texts 

The work was earried out in four sueeessive stages that are shown in the following 
table: 


Matrix generation: Intratextual and intertextual analysis 

Matrix 1 : Keyword distribution per doeument 

Matrix 2: Keyword distribution per artiele seetions 

Matrix 3 : Keyword eombinations 

Matrix 4: Cluster distribution (3 to 8 words) per doeument 

Table 1. Matrix Generation 
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The basic scheme that was followed for each one of the matrices is as follows: 

Matrix 1 



Doc 1 

Doc 2 

Doc 140 

Word 1 

Frequency 



Word 2 




Words 








Word n 





Matrix 2 



Abstract 

Introduction 

Methods 

Results 

Discussion 

Conclusion 

Word 

1 

Freq. 






Word 

2 







Word 

3 














Word 

n 








Matrix 3 



Result 

System 

Words 



Word 100 

Word 1 

Frequency 






Word 2 







Words 














Word n 








Matrix 4 



Doc 1 

Doc 2 

Doc 140 

Clusters 3 Words 

Frequency 



Clusters 4 Words 








Clusters n = ?> 





Table 2. Scheme for matrix generation 


The matrices were generated from the lists of keywords of each area of knowledge and 
from keywords in the corpus in its totality. A software application that we developed 
ourselves in Perl was used for this and each of the matrices was transferred to a spreadsheet. 

The following phase consisted in valuing and determining which computer program 
would be the most adequate to carry out the representation of information in reticular form 
from those intratextual and intertextual aspects that had been obtained in the form of 
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matrices. Ucinet 6 demonstrated to meet the conditions for such aims. For this reason, using 
the Netdraw utility of the tool, we proceeded to carry out different representations that 
allowed us to establish conclusions about the graphical representation of knowledge from 
the matrices of co-occurrences of keywords. 

Ucinet is a tool for the representation of social networks. The analysis of social 
networks constitutes a method for evaluating informal networks by means of the 
representation of the relations between people, equipment, departments or even whole 
organizations. It studies the form in which individuals or organizations are connected and 
defines the position that these occupy in the network, the groups and global structure of the 
network, knowledge and information flows within the network and network relations which 
involve reciprocal influence. For a number of years, this kind of analysis has been applied to 
investigate ongoing collaboration between authors or institutions in scientific publications. 
Examples of this kind of research initiative can be found in Newman (2001), Molina and 
Munoz (2002), Sanz (2003), Gonzalez Alcaide et al. (2006). 

III. RESULTS 

In Matrix 1 pairs of keywords from each of the documents obtained from the individual 
areas that make up the UPV Corpus are represented. By this method, those pairs that are 
specific to a single article as well as those that are repeated in more than one text can be 
identified. The matrix we have selected as an example corresponds to the area of 
Neuroscience. As it is a knowledge domain with a reduced number of articles, it is possible 
to visualize a screenshot in which the distribution of the items in the spreadsheet is shown. 
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Figure 1. Screenshot Matrix 1: Bi-grams per document 


The network we present below (fig.2) demonstrates how the majority of bi-grams are 
usually grouped around single documents in our corpus. In contrast, some of them, 
especially those with a lower semantic load, share intermediate positions as they are found 
in more than one article. The results obtained after the first stage in our analysis lead us to 
confirm that pairs of keywords with high semantic density tend to concentrate in individual 
texts, which denotes the specificity of the articles analyzed. 
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Figure 2. Network example: Bi-grams per document 


The following matrix, Matrix 2, represents the distribution of keywords across the 
different sections in which academic research articles tend to be structured (Abstract, 
Introduction, Methods, Results, Discussion, Conclusion). The information it provides offers 
clear indications with regards to what is known as the ‘aboutness’ of the texts that make up 
the specialised knowledge subdomains or sub-corpora. When analyzing keyword lists from 
the individual areas in previous stages in our study, we obtained global data referring to 
implicit knowledge. At this stage, we have the necessary tools to interpret quantitatively 
how that lexical information with knowledge content is structured in the standard sections of 
academic articles. This issue has been addressed by various different authors, from diverse 
specialist areas, who base their studies on text mining to discover knowledge that is present 
in a large number of texts and which would be impossible to extract manually only by 
means of exhaustive reading. At this point, it should be emphasized that the majority of 
studies have conducted their analysis only by processing the Abstract section of articles. A 
similar analysis, based on keyword distribution in academic article sections, was developed 
by Shah et al. (2003). The reason for concentrating on the Abstract section responds, on the 
one hand, to the availability of abstracts online and, on the other, to the large amount of 
information that is condensed in them. Nevertheless, the Results section is the one that 
covers a greater quantity of information within the article, whereas Abstracts contain a 
greater density of information (Schuemie et al., 2004). 
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If we observe table 3, taken from the area of Chemistry, showing the distribution of 
keywords in eaeh seetions, we will diseover that in the Abstraet seetion terms like 
‘eompound\ ‘polymerie’, ‘immunoassay’ and ‘pestieides’ are repeated signifieantly 
(although this seetion is greatly redueed in extension). In the Method and Results seetions, 
terms that were found to be statistieally relevant are, for example: ‘eurve’, ‘fig’, ‘observed’, 
‘eoneentration/s’, ‘range’, ‘ealibration’, whieh are used to express findings after a proeess or 
model of investigation. The information we obtained by analysing the oeeurrenee and 
distribution of keywords in artiele seetions leads us to eonelude that there exists a eertain 
preferenee or eoneentration in the use of eertain terms in the different seetions in aeademie 
artieles. As Hoey would say, artiele seetions are lexieally primed for eertain words. We 
eould, even, state that these ean be grouped under eategories sinee they tend to show 
eommon lexieal and/or grammatieal features. 


Word 

Abstract 

Introduction 

Methods 

Results 

Conclusion 

1. temperature 

284 

430 

386 

310 

315 

2. peak 

47 

131 

226 

218 

122 

3. potential 

86 

125 

156 

172 

117 

4. sample 

120 

264 

230 

221 

205 

5. curves 

21 

81 

148 

149 

82 

6. water 

233 

228 

234 

251 

206 

7. ph 

70 

139 

170 

164 

108 

8. peaks 

30 

33 

109 

105 

65 

9. fluorescence 

46 

51 

55 

56 

50 

lO.elisa 

31 

30 

39 

46 

51 

1 1 .presence 

113 

94 

124 

130 

132 

12. compounds 

100 

55 

49 

80 

83 

13. compound 

45 

63 

47 

62 

80 

14.fig 

70 

411 

742 

752 

440 

15. determination 

77 

50 

41 

26 

70 

16. antibody 

19 

28 

53 

21 

32 

17. curve 

16 

39 

85 

94 

66 

18. acid 

139 

126 

121 

126 

121 

19. experiments 

82 

164 

87 

83 

87 

20. chemical 

116 

76 

56 

64 

68 

21. organic 

88 

40 

47 

44 

50 

22. assay 

24 

28 

47 

51 

37 

23.chimica 

36 

19 

26 

23 

25 

24. experimental 

119 

145 

145 

141 

122 
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Word 

Abstract 

Introduction 

Methods 

Results 

Conclusion 

2 5. found 

94 

82 

133 

116 

181 

26. solution 

148 

301 

273 

206 

226 

27. observed 

90 

123 

191 

214 

178 

2 8. interaction 

75 

27 

45 

59 

77 

29. polymeric 

35 

15 

12 

14 

37 

3 0 . immuno sensors 

22 

21 

10 

9 

24 

3 l.immuno sensor 

25 

10 

14 

15 

28 

32. solvents 

33 

29 

34 

23 

29 

3 3. concentration 

78 

108 

118 

150 

128 

3 4. range 

79 

96 

100 

118 

79 

3 5. samples 

104 

184 

168 

132 

210 

3 6. liquid 

66 

71 

39 

42 

43 

37. solutions 

82 

153 

85 

74 

52 

3 8. mobility 

39 

11 

19 

16 

14 

3 9. adsorbed 

22 

26 

31 

22 

25 

40. reported 

75 

66 

68 

75 

78 

41. prepared 

60 

111 

37 

19 

32 

42.pesticide 

25 

14 

12 

10 

11 

43 .immunoassays 

29 

12 

10 

6 

10 

44. measured 

63 

114 

120 

89 

52 

4 5. immunoassay 

25 

4 

9 

17 

18 

46. buffer 

17 

58 

73 

46 

23 

47. binding 

62 

27 

18 

46 

26 

48. pesticides 

42 

9 

7 

5 

11 

49 . concentrations 

22 

47 

74 

77 

32 

50. calibration 

14 

27 

31 

26 

26 


Table 3. Distribution of 50 keywords across document sections (Chemistry) 


The distribution of the terms eolleeted in the form of matriees ean be visualized using 
the Netdraw utility. When elieking eaeh of the terms, we will see their number of links and 
the different eategories they eonneet to (the nodes of the network), in this ease, the different 
seetions of an artiele. This proeedure allows us to visualize how a term is eontained in one 
or more artiele seetions. 
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Figure 3. Network example of keywords per document section 

In Matrix 3 the combinations between keywords {bi-grams) have been represented. The 
objective of this type of analysis is to determine how the same element is related to a greater 
or lesser extent with other relevant elements within the subject area. For this reason, the 
matrix was designed in both the vertical and horizontal axes including the same elements: 
the keywords from each area. The numbers in the cells indicate the number of combinations 
that take place between these pairs. The distribution of the information by this method 
allows the researcher to detect whether the co-occurrences are unidirectional or 
bidirectional, as well as the number of repetitions. For example, we could verify that the 
combination ‘apical end’ is very frequent (95 repetitions), as is also ‘basal end’ (87 
appearances), whereas the combination ‘apical bud’ (3 instances) is much less common. 
However, at this stage, we proceeded by asking the following: 

1. Are the combinations ‘apical end’ and ‘basal end’ unidirectional or bidirectional? 

2. Which other terms does the term ‘apical’ interact/combine with ? 

3. Which other combinations are found with ‘basal’ and ‘end’? 
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Figure 4. Matrix example of combinations between keywords 
{Agriculture & Biological Sciences) 


When exploring the table, we observe that ‘apieal end’ is only used in one direetion, 
whereas ‘basal end’ is bidireetional, even though ‘end basal’ is less frequent (3 repetitions). 
In response to the seeond question, we ean see that ‘apieal’ also eo-appears with ‘bud’ and 
‘shoot’. Moreover, we find the eombinations ‘adventitious bud’ and ‘bud formation’. We 
eould expand the interaetion or eo-seleetion of keywords further in this way. 

Likewise, when looking into the matrix for the eombinations of ‘end’ with other terms, 
examples sueh as ‘end table’and ‘stylar end’ are found. We diseover that ‘stylar’ does not 
eo-appear with other keywords. ‘Basal’ is eombined with ‘medium’ (3 repetitions), with 
‘diet’ (20 instanees). When taking for our analysis a knowledge domain with a large number 
of texts, the matrix generated is also of great dimensions. Consequently, when trying to 
represent the eontent of this eomplex matrix graphieally, we diseover that the resulting 
network is a eomplex one-whieh denotes the eomplexity of language-in whieh all the 
existing bonds are displayed. 
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Figure 5. Network example of keyword combinations (Agriculture & Biological Sciences) 


In this maze of interactions the computer tool used offers the option to apply a filter 
with a minimum number of appearances, so that the lines/links below the established 
number will be transparent, although it is also possible by selecting the ‘ego’ option in the 
tool bar to position on one of the elements and visualize solely the relationships that this 
participant of the network displays. 
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Figure 6. Network example of keyword combinations: 1 term 
(Agriculture & Biological Sciences) 


Similarly, the utility allows us to perform multiple queries, by seleeting the required 
elements, for example the -n most frequent keywords, and to represent their relationships. In 
Figure 7, the network generated from the 10 first terms in the matrix is shown. 



Figure 7. Network example of keyword combinations: 10 terms (Agriculture & Biological Sciences) 
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It is up to the researcher to decide on the selection of items to be visualized depending 
on the scope of her or his analysis: on the one hand, s/he might want to visualize all the 
relationships/bonds a term presents; thus, obtaining a conceptual dispersion, that is to say, 
the network will cover extensively the different concepts within the documents analysed. On 
the other hand, the linguist could also focus her or his study only on those combinations that 
are strongest, that is to say, more frequent, which therefore will have a higher conceptual 
and semantic density. 

What we are dealing with is what could be called social networks of language in which 
the individuals or actors are not the members of a group, but terms, and the links are the 
relationships among them. Metaphorically speaking, in the same way as in social networks, 
we are dealing with considerations regarding the type of interactions between individuals: 
the number of times our participants, that is, our keywords, meet certain users in the system 
will imply a more or less significant/relevant relationship (in our case, conceptual and 
semantic density). However, the total number of participants that relate to the same actor, let 
alone the number of times they meet, will imply a greater complexity in the network, 
although its strength or consistency may be lower. With this analysis we have developed a 
lexical framework that has allowed us to generate maps or networks representing the explicit 
knowledge being produced in our academic discourse community. 

The following matrix. Matrix 4, contains clusters or accumulations (strings ranging 
from 3 to 8 words) extracted from each article in the different specialized knowledge areas 
in the UPV Corpus, and also from the Corpus as a whole. 



ABVol84- 

6(1999).txt 

ABVol85- 

l(2000).txt 

ABV 0186 - 

l(2000).txt 

ABV 0187 - 

6(2001).txt 

AE&EV 0 I 9 

5- 

l(2003).txt 

IN ORDER TO 

0 

0 

0 

0 

0 

THE EFFECT OF 

2 

8 

9 

1 

1 

THE NUMBER OF 

13 

15 

16 

12 

0 

DUE TO THE 

1 

3 

0 

1 

1 

THE END OF 

2 

4 

0 

10 

0 

END OF THE 

20 

6 

11 

1 

0 

THE PRESENCE OF 

2 

11 

17 

1 

0 

THE USE OF 

0 

0 

1 

0 

1 

A FUNCTION OF 

0 

0 

0 

0 

0 

WAS CARRIED OUT 

0 

0 

0 

0 

0 

AT THE END 

2 

4 

0 

6 

0 

THE INFEUENCE OF 

3 

5 

4 

3 

0 

AS A FUNCTION 

0 

0 

0 

0 

0 

CAN BE OBSERVED 

0 

0 

0 

0 

0 

ON THE OTHER 

1 

0 

0 

1 

0 

THE OTHER HAND 

1 

0 

0 

1 

0 
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ACCORDING TO 

THE 

1 

2 

1 

0 

4 

CHANGES IN THE 

0 

0 

0 

2 

0 

EFFECT OF THE 

0 

3 

3 

0 

8 

THE PERCENTAGE 
OF 

0 

4 

1 

4 

0 

ARE SHOWN IN 

0 

0 

0 

0 

0 

IN TERMS OF 

0 

0 

0 

0 

0 

REEATED TO THE 

1 

0 

0 

3 

0 


Table 4. Example of cluster distribution (3 words) across documents (Agriculture & Biological Sciences) 


When analyzing the terms in the UPV eorpus in previous stages by extraeting strings of 
identieal reeurrent patterns, we eould verify that, depending on the span we set, we will 
obtain struetures with different lexieal and grammatieal features. In shorter sequenees, like 
the ones shown in the table above, we deteeted expressions that are shared by more than one 
area, sinee they are frequent expressions in aeademie artieles. In most eases, they are 
patterns that tend to be repeated in the majority of texts. The following network faeilitates 
the visualization of this aspeet: 



Figure 8. Network example of clusters (3 words) per document (Agriculture & Biological Sciences) 


However, as strings beeome longer, it is observed that their semantie eontent is higher 
and, therefore, also the higher the eoneeptual information they eonvey. 
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embryo recovery and in vitro development 

embryos recovered in does with at 

for growth rate from weaning to 

for r and v lines respectively 

for the explants incubated in the 

from birth to the first week 

from the marginal posterior density b 

gold coated and viewed in the 

growth rate from weaning to slaughter 

had a significant effect on the 

Table 5. Example of cluster distribution (6 words) across documents 
(Agriculture & Biological Sciences) 

The resulting network from such an analysis demonstrates that these clusters, as they 
contain denser and domain-specific conceptual information, are more characteristic of a 
limited number of articles. 



Figure 9. Network example of cluster distribution (6 words) per document (Agriculture & Biological 

Sciences) 

We conclude our analysis of the results obtained after the methodological approach we 
have implemented with a statement about the complex nature of language: “Language is 
clearly an example of a complex dynamical system. It exhibits highly intricate network 
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structures at all levels (phonetic, lexical, syntactic, semantic) and this structure is to some 
extent shaped and reshaped by millions of language users to over long periods of time, as 
they adapt and change them to their needs local as part of ongoing interactions” (Sole et ah, 
2005: 3). 

IV. CONCLUSION 

Our intention has been to represent discourse as a network of meanings. In this attempt to 
outline a model for the generation of semantic networks basing ourselves on the idea of 
social networks (Barabasi, 2002; Barabasi & Jeong, 2002), and making use of the necessary 
computing tools to achieve our aim (Borgatti, 2003), we have carried out the study taking as 
our point of reference the principles of corpus linguistics, an empirical method that has been 
shown to be an adequate procedure to be able to obtain necessary information on language 
and knowledge. 

By obtaining concordance lines, collocates, collligates, bigrams and clusters, it was 
possible to discover lexico-grammatical aspects of the language used by members of the 
discourse community being studied. As a result of this procedure, we could detect those 
recurrent patterns common to the different texts analyzed and, consequently, characteristic 
of the language that they represent. 

The resulting matrices of lexical and grammatical co-selection examples have opened 
the doors for us to work towards a semantic network of disciplinary knowledge. Starting off 
from the idea of social networks, and making use of Netdraw, we analyzed our UPV Corpus 
as if we were dealing with an organization and whose members would be the different 
lexico-grammatical units and the structures into which they are integrated. In the analysis of 
social networks, one is interested in the consistency of the relations between the actors of the 
organization; that is to say, their stronger or weaker ties. In a similar manner, in our model 
we were interested in the weight of the associations between the linguistic elements that 
conform the language network. 

Our contribution in this aspect has consisted of designing a procedure by which 
different intertextual and intratextual aspects of the analyzed documents can be obtained in 
such a form that one can appreciate the existing bonds between the diverse actors (elements 
of the corpus) that have been submitted to analysis. In this sense, the ideas of Hoey (1991, 
2001) and his conception of sets of texts as network formations have been present when 
formulating the hypothesis that language is recursive and forms a network of meanings that 
carry the semantic content of texts. The establishment and verification of a relationship 
between these networks of meaning and knowledge has been one of the principal objectives 
of the investigation. 

However, at this stage, we should look back at our point of departure, our initial 
hypothesis and conclude this article affirming that the study and the representation of 
explicit knowledge through language because of its complexity needs to be limited to 
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specialist knowledge areas of manageable dimensions. The fundamental problem resides in 
knowing how to formalize what is really significant out of the enormous amount of 
information that can be obtained from a corpus. Stated in other words, there is a need for 
quantitative parameters to determine what should be considered significant and relevant 
information. 
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